Machine Translation
Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning Zachary Charles
We introduce Dataset Grouper, a library to create large-scale group-structured (e.g., federated) datasets, enabling federated learning simulation at the scale of foundation models. This library facilitates the creation of group-structured versions of existing datasets based on user-specified partitions, and directly leads to a variety of useful heterogeneous datasets that can be plugged into existing software frameworks. Dataset Grouper offers three key advantages. First, it scales to settings where even a single group's dataset is too large to fit in memory. Second, it provides flexibility, both in choosing the base (non-partitioned) dataset and in defining partitions.
- North America > United States > Virginia (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (22 more...)
- Education > Curriculum > Subject-Specific Education (0.96)
- Health & Medicine (0.69)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Europe > Austria > Vienna (0.14)
- Asia > South Korea > Incheon > Incheon (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (12 more...)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Africa > Ghana (0.05)
- North America > United States > Pennsylvania > Lackawanna County > Scranton (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Europe > United Kingdom (0.04)
- Leisure & Entertainment (0.68)
- Health & Medicine (0.68)
- Government > Regional Government > North America Government > United States Government (0.47)
- Media > Music (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (8 more...)
A Appendix
The complete list may be seen in Table 8. Here are a few general notes about these strings: 1. Based on their recommendations, we did the following: 1. zh, zh_Latn: This resulted in the special filters described below. URLs) the corpora were in languages different from the LangID predictions. This is mainly mis-rendered PDFs and may have practical applications for denoising, or for decoding such garbled PDFs.
- Oceania > Tonga (0.04)
- North America > United States (0.04)
- South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
- (24 more...)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Communications > Social Media (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
- North America > Canada > Ontario > Toronto (0.04)
- Asia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.73)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- Europe > United Kingdom > Wales (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (6 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)